{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Text Search Engine\n",
    "In this example we will be going over the code used to build a Text Search Engine. This example uses a modified BERT model to convert text to vectors stored in Milvus, which can then be combined with Milvus to search for similar text to the user input text."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data\n",
    "\n",
    "This example uses the English News dataset. In this example, we use a small subset of the dataset containing 180 mutually corresponding title-texts, which can be found in the **Data** directory."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Requirements\n",
    "\n",
    "\n",
    "|  Packages   |  Servers    |\n",
    "|-                  | -                 |   \n",
    "| pymilvus==2.0.0rc5      | milvus 2.0.0-rc5    |\n",
    "| sentence_transformers      | postgres          |\n",
    "| psycopg2          |\n",
    "| pandas           |\n",
    "| numpy   |\n",
    "\n",
    "We have included a `requirements.txt` file in order to easily satisfy the required packages. \n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Up and Running"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Installing Packages\n",
    "Install the required python packages with `requirements.txt`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "pip install -r requirements.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Starting Milvus Server\n",
    "\n",
    "This demo uses Milvus 2.0, please refer to the [Install Milvus](https://milvus.io/cn/docs/install_standalone-docker.md) guide to learn how to use this docker container. For this example we wont be mapping any local volumes. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "!wget https://raw.githubusercontent.com/milvus-io/milvus/master/deployments/docker/standalone/docker-compose.yml -O docker-compose.yml\n",
    "!sudo docker-compose up -d"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Starting Postgres Server\n",
    "For now, Milvus doesn't support storing multiple attributes for the data. Because of this we have to use another service to store these attributes and search through them, in this case PostgreSQL. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "! docker run --name postgres0 -d  -p 5438:5432 -e POSTGRES_HOST_AUTH_METHOD=trust postgres"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "! docker logs postgres0 --tail 6"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Code Overview"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Connecting to Servers\n",
    "We first start off by connecting to the servers. In this case the docker containers are running on localhost and the ports are the default ports. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Connectings to Milvus, BERT and Postgresql\n",
    "from pymilvus import connections\n",
    "import psycopg2\n",
    "connections.connect(host='localhost', port='19530')\n",
    "conn = psycopg2.connect(host='localhost', port='5438', user='postgres', password='postgres')\n",
    "cursor = conn.cursor()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Creating Collection and Setting Index\n",
    "#### 1. Creating the Collection    \n",
    "The next step is to create a collection, which requires declaring the name of the collection and the dimension of the vector."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "TABLE_NAME = \"text_collection\"\n",
    "field_name = \"example_field\"\n",
    "from pymilvus import Collection, CollectionSchema, FieldSchema, DataType\n",
    "pk = FieldSchema(name=\"id\", dtype=DataType.INT64, is_primary=True, auto_id=True)\n",
    "field = FieldSchema(name=field_name, dtype=DataType.FLOAT_VECTOR, dim=768)\n",
    "schema = CollectionSchema(fields=[pk,field], description=\"example collection\")\n",
    "collection = Collection(name=TABLE_NAME, schema=schema)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 2. Setting an Index\n",
    "After creating the collection we want to assign it an index type. This can be done before or after inserting the data. When done before, indexes will be made as data comes in and fills the data segments. In this example we are using IVF_SQ8 which requires the 'nlist' parameter."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Status(code=0, message='')"
      ]
     },
     "execution_count": 43,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "index_param = {\n",
    "        \"metric_type\":\"L2\",\n",
    "        \"index_type\":\"IVF_SQ8\",\n",
    "        \"params\":{\"nlist\":1024}\n",
    "    }\n",
    "collection.create_index(field_name=field_name, index_params=index_param)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Creating Table in Postgres  \n",
    "PostgresSQL will be used to store Milvus ID and its corresponding title and text."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "create postgres table successfully!\n"
     ]
    }
   ],
   "source": [
    "#Deleting previouslny stored table for clean run\n",
    "drop_table = \"DROP TABLE IF EXISTS \" + TABLE_NAME\n",
    "cursor.execute(drop_table)\n",
    "conn.commit()\n",
    "\n",
    "try:\n",
    "    sql = \"CREATE TABLE if not exists \" + TABLE_NAME + \" (ids bigint, title text, text text);\"\n",
    "    cursor.execute(sql)\n",
    "    conn.commit()\n",
    "    print(\"create postgres table successfully!\")\n",
    "except Exception as e:\n",
    "    print(\"can't create a postgres table: \", e)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Processing and Storing the News Data\n",
    "#### 1. Generating Embeddings\n",
    "In this example we are using the sentence_transformer library  to encode the sentence into vectors. This library uses a modified BERT model to generate the embeddings, and in this example we are using a model pretrained using Microsoft's `mpnet`. More info can be found [here](https://www.sbert.net/docs/pretrained_models.html#sentence-embedding-models)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "You try to use a model that was created with version 1.2.0, however, your version is 1.1.1. This might cause unexpected behavior or errors. In that case, try to update to the latest version.\n",
      "\n",
      "\n",
      "\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'numpy.ndarray'>\n"
     ]
    }
   ],
   "source": [
    "from sentence_transformers import SentenceTransformer\n",
    "import pandas as pd\n",
    "from sklearn.preprocessing import normalize\n",
    "\n",
    "model = SentenceTransformer('paraphrase-mpnet-base-v2')\n",
    "# Get questions and answers.\n",
    "data = pd.read_csv('data/example.csv')\n",
    "title_data = data['title'].tolist()\n",
    "text_data = data['text'].tolist()\n",
    "\n",
    "sentence_embeddings = model.encode(title_data)\n",
    "sentence_embeddings = normalize(sentence_embeddings)\n",
    "print(type(sentence_embeddings))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 2. Inserting Vectors into Milvus\n",
    "Since this example dataset contains only 100 vectors, we are inserting all of them as one batch insert."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {},
   "outputs": [],
   "source": [
    "em =list(sentence_embeddings)\n",
    "mr = collection.insert([em])\n",
    "ids = mr.primary_keys\n",
    "dicts ={}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 3. Inserting IDs and Title-text into PostgreSQL\n",
    "In order to transfer the data into Postgres, we are creating a new file that combines all the data into a readable format. Once created, we pass this file into the Postgress server through STDIN due to the Postgres container not having access to the file locally. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Inserted into Postgress Sucessfully!\n"
     ]
    }
   ],
   "source": [
    "import os \n",
    "\n",
    "def record_temp_csv(fname, ids, title, text):\n",
    "    with open(fname,'w') as f:\n",
    "        for i in range(len(ids)):\n",
    "            line = str(ids[i]) + \"|\" + title[i] + \"|\" + text[i] + \"\\n\"\n",
    "            f.write(line)\n",
    "\n",
    "def copy_data_to_pg(table_name, fname, conn, cur):\n",
    "    fname = os.path.join(os.getcwd(),fname)\n",
    "    try:\n",
    "        sql = \"COPY \" + table_name + \" FROM STDIN DELIMITER '|' CSV HEADER\"\n",
    "        cursor.copy_expert(sql, open(fname, \"r\"))\n",
    "        conn.commit()\n",
    "        print(\"Inserted into Postgress Sucessfully!\")\n",
    "    except Exception as e:\n",
    "        print(\"Copy Data into Postgress failed: \", e)\n",
    "        \n",
    "DATA_WITH_IDS = 'data/test.csv'   \n",
    "\n",
    "record_temp_csv(DATA_WITH_IDS, ids, title_data, text_data)\n",
    "copy_data_to_pg(TABLE_NAME, DATA_WITH_IDS, conn, cursor)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Search\n",
    "#### 1. Processing Query\n",
    "When searching for a question, we first put the question through the same model to generate an embedding. Then with that embedding vector we  can search for similar embeddings in Milvus.  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {},
   "outputs": [],
   "source": [
    "search_params = {\"metric_type\": \"L2\", \"params\": {\"nprobe\": 10}}\n",
    "\n",
    "query_vec = []\n",
    "\n",
    "title = \"Loosing the War on Terrorism\"\n",
    "\n",
    "query_embeddings = []\n",
    "embed = model.encode(title)\n",
    "embed = embed.reshape(1,-1)\n",
    "embed = normalize(embed)\n",
    "query_embeddings = embed.tolist()\n",
    "\n",
    "collection.load()\n",
    "results = collection.search(query_embeddings, field_name, param=search_params, limit=9, expr=None)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 2. Getting the Similar Titles\n",
    "There may not have titles that are similar to the given one. So we can set a threshold value, here we use 0.5, and when the most similar distance retrieved is less than this value, a hint that the system doesn't include the relevant question is returned. We then use the result ID's to pull out the similar titles from the Postgres server and print them with their corresponding similarity score."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "There are no similar questions in the database, here are the closest matches:\n",
      "('Loosing the War on Terrorism', 4.2543743593130567e-13)\n",
      "('Politics an Afterthought Amid Hurricane  ', 1.4113149642944336)\n",
      "('Kerry-Kerrey Confusion Trips Up Campaign  ', 1.4152636528015137)\n",
      "('News: Sluggish movement on power grid cyber security', 1.4480023384094238)\n",
      "('Promoting a Shared Vision', 1.4696528911590576)\n",
      "('U.S. Brokers Cease-fire in Western Afghanistan', 1.473806619644165)\n",
      "('On front line of AIDS in Russia', 1.493909478187561)\n",
      "('Fresh Fighting Shatters Short-Lived Ceasefire Deal', 1.5365568399429321)\n",
      "('Flop in the ninth inning sinks Jays', 1.568418264389038)\n"
     ]
    }
   ],
   "source": [
    "similar_titles = []\n",
    "\n",
    "if results[0][0].distance < 0.5:\n",
    "    print(\"There are no similar questions in the database, here are the closest matches:\")\n",
    "else:\n",
    "    print(\"There are similar questions in the database, here are the closest matches: \")\n",
    "    \n",
    "for result in results[0]:\n",
    "    sql = \"select title from \" + TABLE_NAME + \" where ids = \" + str(result.id) + \";\"\n",
    "    cursor.execute(sql)\n",
    "    rows=cursor.fetchall()\n",
    "    if len(rows):\n",
    "        similar_titles.append((rows[0][0], result.distance))\n",
    "        print((rows[0][0], result.distance))\n",
    "       "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 3. Get the text\n",
    "After getting a list of similar titles, choose the one that you feel is closest to yours. Then you can use that title to find the corresponding text in Postgres."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Title:\n",
      "Loosing the War on Terrorism\n",
      "Text:\n",
      " Sven Jaschan, self-confessed author of the Netsky and Sasser viruses, is responsible for 70 percent of virus infections in 2004, according to a six-month virus roundup published Wednesday by antivirus company Sophos.  The 18-year-old Jaschan was taken into custody in Germany in May by police who said he had admitted programming both the Netsky and Sasser worms, something experts at Microsoft confirmed. (A Microsoft antivirus reward program led to the teenager's arrest.) During the five months preceding Jaschan's capture, there were at least 25 variants of Netsky and one of the port-scanning network worm Sasser.  Graham Cluley, senior technology consultant at Sophos, said it was staggeri   \n"
     ]
    }
   ],
   "source": [
    "sql = \"select text from \" + TABLE_NAME + \" where title = '\" + similar_titles[0][0] + \"';\"\n",
    "cursor.execute(sql)\n",
    "rows=cursor.fetchall()\n",
    "print(\"Title:\")\n",
    "print(title)\n",
    "print(\"Text:\")\n",
    "print(rows[0][0])"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
